Bayesian Logistic Regression

Diabetes dataset

Let us analyze Pima Indians Diabetes Database: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least $21$ years old of Pima Indian heritage.

This dataset contains $768$ observations, $268$ of $768$ are diabetic.

Target variable:

Predictors:

MCMC approach

Bayesian Logistic Regression model: $$ t_i | x_i, \mathbf w \sim \mathrm{Be}(\theta(\mathbf x_i)) \\ \theta(x) = \mathrm{logit}(\mathbf w^T \mathbf x) \\ \mathbf w_j \sim \mathcal N(0, S_j^2) \\ S_0 = 10;\ S_k = 2, \ k = \overline{1,8} $$

Now we have $4$ chains $2000$ length for every $\mathbf w_j,\ j=\overline{0,8}$.

Let's analyze whether chains converged to a stationary distribution or not.

The graphs below shows that the chains are indistinguishable. That is a good sign!

There is no problems with autocorrelations.

All $\hat{R}_j,\ j=\overline{0,8}$ are very close to $1$, effective sample sizes are also pretty good.

Let's take a look at the posterior $\mathbf w_j$ distributions.

So we have weights samples for normalized data. Now we will calculate the weights for original data.

$\mathbf x^o ~-$ normalized data, $\mathbf w ~-$ weights for normalized data; $\mathbf x ~-$ original data, $w ~-$ weigths for original data.

$$\theta(x) = \mathrm{logit}(\mathbf w^T \mathbf x^o) \\ \mathbf w^T \mathbf x^o = \mathbf w_0 + \sum\limits_{i=1}^n \mathbf w_i x_i^{o} = \mathbf w_0 + \mathbf w_1 \frac{\mathbf x_1 - m_1}{\sigma_1} + \ldots + \frac{\mathbf x_n - m_n}{\sigma_n} = \mathbf w_0 - \sum\limits_{i=1}^n \mathbf w_i \frac{m_i}{\sigma_i} + \frac{\mathbf w_1}{\sigma_1} \mathbf x_1 + \ldots + \frac{\mathbf w_n}{\sigma_n} \mathbf x_n = w_0 + \sum\limits_{i=1}^n w_i \mathbf x_i = w^T \mathbf x $$

where $w_0=\mathbf w_0 - \sum\limits_{i=1}^n \mathbf w_i \frac{m_i}{\sigma_i}$; $w_i=\frac{\mathbf w_i}{\sigma_i},\ i=\overline{1,n}$.

Now we will calculate predicted probabilities for all $768$ objects in dataset. At first, we fit all $8000$ models to every object. Therefore, we have $8000$ probabilities for every object. To get only one predicted probability for every object we will average the bunch of predictions obtained by the ensemble of $8000$ models.

Let's evaluate the adequacy of the model on the training data. To do this, we fix a trivial cutoff level $0.5$. So $\{t=1\} \Longleftrightarrow \{\theta(x) > 0.5\}$.

Accuracy $78.26\%$ is acceptable.

Prediction of new data

Now let's turn to the Internet to find out the indicators of a healthy and diabetic person.

Let $\theta(\mathbf x)$ be a probability of diabetes for patient $\mathbf x$.

Distribution of $\theta$ for healthy patient.

Distribution of $\theta$ for diabetic patient.